Skip to content

Add bulk-sparse native vector scoring for searchable snapshots via DirectAccessInput#144557

Merged
ChrisHegarty merged 33 commits intoelastic:mainfrom
ChrisHegarty:withByteBufferSlices
Mar 25, 2026
Merged

Add bulk-sparse native vector scoring for searchable snapshots via DirectAccessInput#144557
ChrisHegarty merged 33 commits intoelastic:mainfrom
ChrisHegarty:withByteBufferSlices

Conversation

@ChrisHegarty
Copy link
Copy Markdown
Contributor

@ChrisHegarty ChrisHegarty commented Mar 19, 2026

This PR builds on the zero-copy DirectAccessInput infrastructure introduced in #141718 to extend native SIMD bulk vector scoring to searchable snapshot (SNAP) data. Previously, native bulk scoring was limited to memory-mapped files (MemorySegmentAccessInput); SNAP inputs fell back to one-at-a-time Java scoring.

During HNSW graph traversal, the search algorithm scores a batch of candidate neighbor vectors against the query vector in each step. With this change, those batches can now be scored in a single native SIMD call even when the underlying data lives in the shared blob cache rather than a memory-mapped file. The new DirectAccessInput.withByteBufferSlices API provides zero-copy access to the cached regions' direct byte buffers, allowing native memory addresses to be extracted and passed directly to the bulk-gather scoring functions without any heap copying. When any vector in a bulk batch crosses a cache region boundary, the entire batch falls back to one-at-a-time scoring. In practice this should be rare: for typical configurations the per-batch fallback probability is well under 1%.

Key changes:

  • DirectAccessInput.withByteBufferSlices (libs/core): New bulk multi-region zero-copy access method, complementing the single-region withByteBufferSlice from Enable zero-copy SIMD vector scoring on searchable snapshots (frozen tier) #141718. Implementations in SharedBlobCacheService.CacheFile, FrozenIndexInput, BlobCacheIndexInput, and StoreMetricsIndexInput handle offset adjustment for sliced inputs and graceful fallback (return false) when regions cross cache boundaries or are not mmap-backed.
  • BULK_GATHER native operation (libs/simdvec/native): New C++ bulk-gather functions for aarch64 and amd64 that accept an array of native memory addresses (one per vector) instead of requiring contiguous memory. Corresponding BULK_GATHER operation plumbing through VectorSimilarityFunctions and JdkVectorLibrary.
  • IndexInputUtils.withSliceAddresses (libs/simdvec): Utility that resolves file byte offsets to native memory addresses, dispatching through MemorySegmentAccessInput (pointer arithmetic) or DirectAccessInput (withByteBufferSlices). Includes reachabilityFence calls to ensure backing memory remains valid during native calls.
  • ByteVectorScorer and Int7SQVectorScorer (libs/simdvec): Refactored to use withSliceAddresses for bulk scoring, supporting both mmap and SNAP inputs through a unified code path. GatherScorer extracted as a shared top-level interface.
  • Test coverage: New tests across SharedBlobCacheServiceTests, FrozenIndexInputTests, BlobCacheIndexInputTests, StoreMetricsIndexInputTests, IndexInputUtilsTests, and ByteVectorScorerFactoryTests covering bulk access, offset adjustment on sliced inputs, cross-region boundary fallback, eviction scenarios, and the super.bulkScore() fallback path.

@ChrisHegarty ChrisHegarty added :Search Relevance/Vectors Vector search Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch labels Mar 19, 2026
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Hi @ChrisHegarty, I've created a changelog YAML for you.

@elasticsearchmachine elasticsearchmachine added the serverless-linked Added by automation, don't add manually label Mar 19, 2026
elasticsearchmachine and others added 2 commits March 19, 2026 13:00
Copy link
Copy Markdown
Contributor

@ldematte ldematte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave it a quick first pass, concentrating especially on the native part and how it interacts (how we pass the dataset). Looks good!

long[] offsets,
int length,
int count,
long[] addrs,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose this is a parameter because we'd likely reuse it, and it's not directly a MemorySegment (of size count * ADDRESS.bytes) because we want to call it from code that does not have the preview things?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah. This could be a premature optimisation. Lemme revert it, as it's not clear that it's worth it at this point.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it's OK I think, just wanted to confirm I understood it correctly

ChrisHegarty added a commit that referenced this pull request Mar 20, 2026
While working on bulk sparse scoring (#144557), I noticed that INT8 and FLOAT32 were missing testBulkIllegalDims coverage that INT7U, INT4, and BBQ already have. Extracting this into a small targeted PR.

Both new tests verify IOOBE for count overflow, negative count, negative dims, and undersized result buffer, matching the existing pattern in JDKVectorLibraryInt7uTests.
ChrisHegarty added a commit that referenced this pull request Mar 20, 2026
While working on bulk sparse scoring (#144557), I noticed the existing BULK_OFFSETS tests only use random offsets. Random offsets probabilistically cover duplicates and may happen to produce a sequential pattern, but neither case is guaranteed or verified explicitly, so I added two new tests make the patterns deterministic and assert specific properties that random offsets do not.

I added these to INT7U only since the offset dispatch logic is the same array_mapper template across all element types. A bug in offset handling would surface here; other type-specific arithmetic is already covered by the existing per-type random-offset tests.
ChrisHegarty added a commit that referenced this pull request Mar 20, 2026
…144645)

While working on bulk sparse scoring (#144557), I noticed that ByteVectorScorerFactoryTests only tested per-ordinal score() via the supplier path. This PR adds bulk scoring and
query-side scorer coverage, extracted from ongoing work on bulk sparse scoring (#144557).

The test structure is designed so that SNAP directory variants can be added alongside the MMap tests once DirectAccessInput support lands.
@ChrisHegarty ChrisHegarty marked this pull request as ready for review March 23, 2026 10:39
@ChrisHegarty ChrisHegarty requested a review from a team as a code owner March 23, 2026 10:39
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-search-relevance (Team:Search Relevance)

* <ol>
* <li>Array of 8-byte longs containing the native memory address of each vector</li>
* <li>Single vector to score against</li>
* <li>Number of dimensions, or for bbq, the number of index bytes</li>
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this isn't for BBQ (yet?)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just didn't write the native code for it yet, but given how this is progressing - the native mapper template should be trivial. lemme take a look.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BBQ can use a similar technique, but the code is a bit more involved. Let's do it as a follow up.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to do this for BBQ/DiskBBQ? I think that in that case data is always contiguous...

Copy link
Copy Markdown
Member

@thecoop thecoop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few test tweaks, but otherwise vector side looks good

@ChrisHegarty ChrisHegarty changed the title Add bulk-gather native vector scoring for searchable snapshots via DirectAccessInput Add bulk-sparse native vector scoring for searchable snapshots via DirectAccessInput Mar 24, 2026
@ChrisHegarty ChrisHegarty enabled auto-merge (squash) March 25, 2026 17:43
@ChrisHegarty ChrisHegarty disabled auto-merge March 25, 2026 17:43
@ChrisHegarty ChrisHegarty merged commit a33042d into elastic:main Mar 25, 2026
33 of 53 checks passed
ChrisHegarty added a commit that referenced this pull request Mar 26, 2026
)

While working on bulk sparse scoring (#144557), I noticed that checkBulkOffsets and checkBBQBulkOffsets validated segment sizes but not individual offset values. An out-of-range or negative offset would silently read memory beyond the data segment, risking a crash or silently wrong results.

The solution is to replace the sequential size check with per-offset validation that checks each offset points to a valid vector within the data segment. The O(count) loop should be negligible relative to the O(count * dims) native call, but we've made the checks conditional on asserts to avoid any potential negative cost of this, and asserts should be good enough given our testing.

Note: INT4 skips size=2 (packedLen=1) because checkBulkOffsets computes rowBytes = packedLen * 4 / 8 which truncates to 0 via integer division, making the bounds check trivially pass. This is a pre-existing issue with how INT4 passes packed byte length (not element count) as the length parameter to the generic check formula. We can address this separately, if needed.
seanzatzdev pushed a commit to seanzatzdev/elasticsearch that referenced this pull request Mar 26, 2026
…tic#144643)

While working on bulk sparse scoring (elastic#144557), I noticed that checkBulkOffsets and checkBBQBulkOffsets validated segment sizes but not individual offset values. An out-of-range or negative offset would silently read memory beyond the data segment, risking a crash or silently wrong results.

The solution is to replace the sequential size check with per-offset validation that checks each offset points to a valid vector within the data segment. The O(count) loop should be negligible relative to the O(count * dims) native call, but we've made the checks conditional on asserts to avoid any potential negative cost of this, and asserts should be good enough given our testing.

Note: INT4 skips size=2 (packedLen=1) because checkBulkOffsets computes rowBytes = packedLen * 4 / 8 which truncates to 0 via integer division, making the bounds check trivially pass. This is a pre-existing issue with how INT4 passes packed byte length (not element count) as the length parameter to the generic check formula. We can address this separately, if needed.
seanzatzdev pushed a commit to seanzatzdev/elasticsearch that referenced this pull request Mar 27, 2026
…rectAccessInput (elastic#144557)

This PR builds on the zero-copy DirectAccessInput infrastructure introduced in elastic#141718 to extend native SIMD bulk vector scoring to searchable snapshot (SNAP) data. Previously, native bulk scoring was limited to memory-mapped files (MemorySegmentAccessInput); SNAP inputs fell back to one-at-a-time Java scoring.

During HNSW graph traversal, the search algorithm scores a batch of candidate neighbor vectors against the query vector in each step. With this change, those batches can now be scored in a single native SIMD call even when the underlying data lives in the shared blob cache rather than a memory-mapped file. The new `DirectAccessInput.withByteBufferSlices` API provides zero-copy access to the cached regions' direct byte buffers, allowing native memory addresses to be extracted and passed directly to the bulk-gather scoring functions without any heap copying. When any vector in a bulk batch crosses a cache region boundary, the entire batch falls back to one-at-a-time scoring. In practice this should be rare: for typical configurations the per-batch fallback probability is well under 1%.

Key changes:
*  `DirectAccessInput.withByteBufferSlices` (libs/core): New bulk multi-region zero-copy access method, complementing the single-region `withByteBufferSlice` from elastic#141718. Implementations in `SharedBlobCacheService.CacheFile`, `FrozenIndexInput`, `BlobCacheIndexInput`, and `StoreMetricsIndexInput` handle offset adjustment for sliced inputs and graceful fallback (return false) when regions cross cache boundaries or are not mmap-backed.
* `BULK_GATHER` native operation (libs/simdvec/native): New C++ bulk-gather functions for aarch64 and amd64 that accept an array of native memory addresses (one per vector) instead of requiring contiguous memory. Corresponding BULK_GATHER operation plumbing through `VectorSimilarityFunctions` and `JdkVectorLibrary`.
* `IndexInputUtils.withSliceAddresses` (libs/simdvec): Utility that resolves file byte offsets to native memory addresses, dispatching through `MemorySegmentAccessInput` (pointer arithmetic) or `DirectAccessInput` (withByteBufferSlices). Includes reachabilityFence calls to ensure backing memory remains valid during native calls.
* `ByteVectorScorer` and `Int7SQVectorScorer` (libs/simdvec): Refactored to use withSliceAddresses for bulk scoring, supporting both mmap and SNAP inputs through a unified code path. `GatherScorer` extracted as a shared top-level interface.
* Test coverage: New tests across SharedBlobCacheServiceTests, FrozenIndexInputTests, BlobCacheIndexInputTests, StoreMetricsIndexInputTests, IndexInputUtilsTests, and ByteVectorScorerFactoryTests covering bulk access, offset adjustment on sliced inputs, cross-region boundary fallback, eviction scenarios, and the super.bulkScore() fallback path.
seanzatzdev pushed a commit to seanzatzdev/elasticsearch that referenced this pull request Mar 27, 2026
…tic#144643)

While working on bulk sparse scoring (elastic#144557), I noticed that checkBulkOffsets and checkBBQBulkOffsets validated segment sizes but not individual offset values. An out-of-range or negative offset would silently read memory beyond the data segment, risking a crash or silently wrong results.

The solution is to replace the sequential size check with per-offset validation that checks each offset points to a valid vector within the data segment. The O(count) loop should be negligible relative to the O(count * dims) native call, but we've made the checks conditional on asserts to avoid any potential negative cost of this, and asserts should be good enough given our testing.

Note: INT4 skips size=2 (packedLen=1) because checkBulkOffsets computes rowBytes = packedLen * 4 / 8 which truncates to 0 via integer division, making the bounds check trivially pass. This is a pre-existing issue with how INT4 passes packed byte length (not element count) as the length parameter to the generic check formula. We can address this separately, if needed.
mamazzol pushed a commit to mamazzol/elasticsearch that referenced this pull request Mar 30, 2026
…rectAccessInput (elastic#144557)

This PR builds on the zero-copy DirectAccessInput infrastructure introduced in elastic#141718 to extend native SIMD bulk vector scoring to searchable snapshot (SNAP) data. Previously, native bulk scoring was limited to memory-mapped files (MemorySegmentAccessInput); SNAP inputs fell back to one-at-a-time Java scoring.

During HNSW graph traversal, the search algorithm scores a batch of candidate neighbor vectors against the query vector in each step. With this change, those batches can now be scored in a single native SIMD call even when the underlying data lives in the shared blob cache rather than a memory-mapped file. The new `DirectAccessInput.withByteBufferSlices` API provides zero-copy access to the cached regions' direct byte buffers, allowing native memory addresses to be extracted and passed directly to the bulk-gather scoring functions without any heap copying. When any vector in a bulk batch crosses a cache region boundary, the entire batch falls back to one-at-a-time scoring. In practice this should be rare: for typical configurations the per-batch fallback probability is well under 1%.

Key changes:
*  `DirectAccessInput.withByteBufferSlices` (libs/core): New bulk multi-region zero-copy access method, complementing the single-region `withByteBufferSlice` from elastic#141718. Implementations in `SharedBlobCacheService.CacheFile`, `FrozenIndexInput`, `BlobCacheIndexInput`, and `StoreMetricsIndexInput` handle offset adjustment for sliced inputs and graceful fallback (return false) when regions cross cache boundaries or are not mmap-backed.
* `BULK_GATHER` native operation (libs/simdvec/native): New C++ bulk-gather functions for aarch64 and amd64 that accept an array of native memory addresses (one per vector) instead of requiring contiguous memory. Corresponding BULK_GATHER operation plumbing through `VectorSimilarityFunctions` and `JdkVectorLibrary`.
* `IndexInputUtils.withSliceAddresses` (libs/simdvec): Utility that resolves file byte offsets to native memory addresses, dispatching through `MemorySegmentAccessInput` (pointer arithmetic) or `DirectAccessInput` (withByteBufferSlices). Includes reachabilityFence calls to ensure backing memory remains valid during native calls.
* `ByteVectorScorer` and `Int7SQVectorScorer` (libs/simdvec): Refactored to use withSliceAddresses for bulk scoring, supporting both mmap and SNAP inputs through a unified code path. `GatherScorer` extracted as a shared top-level interface.
* Test coverage: New tests across SharedBlobCacheServiceTests, FrozenIndexInputTests, BlobCacheIndexInputTests, StoreMetricsIndexInputTests, IndexInputUtilsTests, and ByteVectorScorerFactoryTests covering bulk access, offset adjustment on sliced inputs, cross-region boundary fallback, eviction scenarios, and the super.bulkScore() fallback path.
mamazzol pushed a commit to mamazzol/elasticsearch that referenced this pull request Mar 30, 2026
…tic#144643)

While working on bulk sparse scoring (elastic#144557), I noticed that checkBulkOffsets and checkBBQBulkOffsets validated segment sizes but not individual offset values. An out-of-range or negative offset would silently read memory beyond the data segment, risking a crash or silently wrong results.

The solution is to replace the sequential size check with per-offset validation that checks each offset points to a valid vector within the data segment. The O(count) loop should be negligible relative to the O(count * dims) native call, but we've made the checks conditional on asserts to avoid any potential negative cost of this, and asserts should be good enough given our testing.

Note: INT4 skips size=2 (packedLen=1) because checkBulkOffsets computes rowBytes = packedLen * 4 / 8 which truncates to 0 via integer division, making the bounds check trivially pass. This is a pre-existing issue with how INT4 passes packed byte length (not element count) as the length parameter to the generic check formula. We can address this separately, if needed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>enhancement :Search Relevance/Vectors Vector search serverless-linked Added by automation, don't add manually Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch test-arm Pull Requests that should be tested against arm agents v9.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants